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Abstract 

An ordered labeled tree is a tree in which the nodes are labeled and the left-to-right order among 
siblings is relevant. The edit distance between two ordered labeled trees is the minimum cost of changing 
one tree into the other through a sequence of edit steps. In the literature, there are a class of algorithms 
based on different yet closely related path-decomposition schemes. This article reviews the principles of 
these algorithms, and studies the concepts related to the algorithmic complexities as a consequence of 
the decomposition schemes. 


1 Introduction 

An ordered labeled tree is a tree in which the nodes are labeled and the left-to-right order among siblings is 
significant. 

The tree edit distance metric was introduced by Tai as a generalization of the string editing problem [12] . 
Given two trees Tj and T 2 , the tree edit distance between Tj and T 2 is the minimum cost to change one 
tree into the other by a sequence of edit steps. Tai m gave an algorithm with a time complexity of 
0(|Ti| 3 x |T 2 | 3 ). Subsequently, a number of improved algorithms were developed [131 [7] [3J U [B, EH- 
Billc [T| presented a survey on the tree edit distance algorithms. This article focuses on a class of algorithms 
that are based on closely related dynamic programming approaches, developed by Zhang and Shasha m , 
Klein [7], and Demaine et al. [3], with time complexities of 0(|Ti| x |T 2 | x jp =1 min {depth(Ti), #leaves(Ti )}), 
0(|Ti| 2 x |T 2 | x log |T 2 |), and 0(|Ti| 2 x |T 2 | x (1 + log respectively. The essential features common in 

these algorithms are: 

1. a postorder enumeration of the subproblems, 

2. the recursive partitioning of trees into disjoint paths, each associated with a separate subtree-subtree 
distance computation. 

The notions related to these paths as a result of the recursive partitioning were formalized by Dulucq and 
Touzet [5] , and referred to as “decomposition strategies”. The algorithm by Demaine et al. yields the best 
worst-case time complexity. They also showed that there exist trees for which fl(|Ti| 2 x |T 2 | x (1 + log j^yf)) 
time is required to compute the distance no matter what strategy is used. 

In this article, we review and study the concepts underlying various algorithmic approaches based on 
“decomposition strategies” as well as their impacts on the time complexity in computing the tree edit 
distance. 

The article is organized as follows. Section [2] introduces the problem of tree edit distance, and gives some 
initial solutions based on naive strategies. Section [3] presents improved strategies, focusing on the conceptual 
aspects related to the time complexities. Section [I] gives concluding remarks. 
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2 Preliminaries 


Before we study the tree edit distance problem, it would be beneficial to recall the solution for string edit 
distance because the tree problem is a generalization of the string problem, and the solution for the tree 
problem may be constructed in ways analogous to the string problem. The string edit distance d(Si, S 2 ) can 
be solved by Equation Q] where u and v may be both the last elements or the first elements of (Si, 52 ). The 
three basic edit steps are substitution, deletion, and insertion, with respective costs being S(u,v), S(u,0 ), 
and 5(0, v). 

Definition 1 (String Edit Distance). The edit distance d(Si 1 S 2 ) between two strings Si and S 2 is the 
minimum cost to change S 1 to S 2 via a sequence of basic edit steps. 

d(Si — u, S 2 ) + 8(u, 0), 1 

d{Si,S 2 -v) + 5(0,i;), > . (1) 

d(Si - u, S 2 — v) + S(u, v) J 

We now turn to the tree edit distance. First, we define some basic notations that will be useful in the 
rest of the article. 

Given a tree T, we denote by r(T) its root and t[i] the ith node in T. The subtree rooted at t[i] is denoted 
by T[i\. Denote by F o G the left-to-right concatenation of F and G. The notation F — T represents the 
structure resulted from removing T from F. 

Definition 2 (Tree Edit Distance). The edit distance d(Ti,T 2 ) between two trees T\ and T 2 is the minimum 
cost to change T\ to T 2 via a sequence of basic edit steps. 

Analogous to string editing, there are three basic edit operations on a tree: substitution of which the cost 
is S(ti, < 2 ), insertion of which the cost is 5(0, t 2 ), and deletion of which the cost is 5(ti, 0). The substitution 
operation substitutes a tree node with another one. The insertion operation inserts a node into a tree. If 
the inserted node is made a child of some node in the tree, the children of this node become the children of 
the inserted node. The deletion operation deletes a node from a tree, and the children of the deleted node 
become the children of the parent of the deleted node. These operations are displayed in Figure [Q 

The set of substitution steps can be represented as a mapping relation satisfying the following conditions: 

1. One-to-one mapping: A node in one tree can be mapped to at most one node in another tree. 

2. Sibling order is preserved: For any two substitution steps {ti[i\ —)• t 2 [j]) and (fi[i'] —> t 2 [j']) in the edit 
script, t\[i] is to the left of ti[i'] if and only if t 2 [j] is to the left of t 2 [j'] (see Figure [2(aJ| ) ■ 

3. Ancestor order is preserved: For any two substitution steps {t\[i] —» t 2 [j}) and {ti[i'} —>• t 2 [j']) in the 
edit script, ti[i] is an ancestor of ti\i'] if and only if t 2 \j] is an ancestor of [j 7 ] (see Figure [2(b)] ). 

As a consequence of these conditions, the substitution steps are consistent with the structural hierarchy 
in the original trees. 

For the class of algorithms that we consider, the solution for tree edit distance is based on the recursive 
formula for forest edit distance in Equation [2] 

f d(F-r(T),G)+6(r(T),0), ) 

d(F,G) = mint d(F, G — r(T')) + 5(0, r(T')), \ . (2) 

[ d(F — T,G — T') + d(T, T') ) 

A forest as a sequence of subtrees bears resemblance to a string if each subtree is viewed as a unit of 
element. A string can be represented as a sequence, or an ordered set, of labeled nodes. A forest reduces 
to a string when each subtree contains a single node. In this view, the problem of forest distance may be 
approached in ways analogous to the string distance problem, and the solution would be a generalization 
of the string solution. The meaning of such a solution is based on the principle, analogous as in the string 


d(Si, S 2 ) = min ■ 
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Figure 1 Basic tree edit operations. 





(a —> 6) 


(a) substitution 



(c->- 0) 


(b) deletion 



(0 -> c) 


(c) insertion 


T 2 





Figure 2 Tree editing conditions that preserve sibling orders and ancestor orders. 
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case, that if we know the solutions of some subproblems each of which being a modification from the original 
problem by one of the three aforementioned basic operations, then the solution of the original problem can be 
constructed from the solutions of these subproblems by means of a finite number of simple arithmetics. The 
same principle holds recursively for all the subproblems. The tree-to-tree distance d(T,T') in Equation [2] is 
computed as in Equation[3] Meanwhile, when both forests are composed of one tree (i.e., ( F,G ) = (T, T')), 
Equation [2] reduces to Equation [3] which in turn makes use of Equation [2] for 
subforest distances. 

f d(T — r(T),T') + S(r(T), 0), 
d(T, T') = min l d(T, T' - r{T')) + <5(0, r(T')), 

( d(T - r(T), V - r(T')) + 6(r(T), r(T')) 

The recursion in Equation [2] takes on two possible directions (see Figure [3]): 

• leftmost recursion where both r(T) and r(T') are leftmost roots, 

• rightmost recursion where both r(T) and r(T') are rightmost roots. 


computing the associated 


(3) 


Figure 3 Recursion directions. 



(c-l) leftmost substitution (c-2) rightmost substitution 


There are a few things to note regarding the above formulae. First, we need all the subtree-subtree 
distances in order to construct the solution. That is, given Q a = {T a [i\ | t a [i\ £ T 0 } with a £ {1,2}, we 
need to compute all the distances for Q\ x Q i. Since we are solving an optimization problem, the result is 
optimal only if all possible cases have been considered from which the optimal one is selected. This means 
that all combinations of node-to-node mappings which satisfy the editing conditions need to be considered, 
which translates into the need for computing all subtree-subtree distances. Second, the direction of recursion 
has an influence on which subforests would be relevant in the construction of the solution. These are the 
subforests that would appear in the recursive calls. 


Definition 3 (Relevant Subforests). The “relevant subforests” with respect to a tree edit distance solution 
are those subforests that appear in the recursive calls in Equation [H 


Figure 4(a) and Figure 4(b) show examples of relevant subforests generated from rightmost recursion. 


Figure 5(a) and Figure 5(b) show examples of relevant subforests generated from a recursion that operates 
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Figure 4 An example showing the relevant subforests from the rightmost recursion with respect to the 
leftmost path. 






(b) relevant subforests resulted from deletions and substitutions 
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Figure 5 An example showing the relevant subforests from a recursion that operates on the left side and 
right side intermittently with respect to a predefined path. 






(a) relevant subforests resulted from successive deletions 



(b) relevant subforests resulted from deletions and substitutions 
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on the left side and right side intermittently with respect to a predefined path. More details will be given in 
the next section regarding this type of recursion. 

In constructing an algorithmic solution based on Equation [2j there are two complementary aspects to 
consider: 

• Top-down aspect: This concerns the direction from the left-hand side to the right-hand side of the 
recursion. 

• Bottom-up aspect: This concerns the direction from the right-hand side to the left-hand side of the 
recursion. 

In the context of complexity analysis, we express the number of elementary operations in terms of the 
number of recursive calls along relevant recursion paths or the number of steps in a bottom-up enumeration 
sequence, interchangeably. This is due to the fact that to every sequence of top-down recursive calls based 
on Equation [2] corresponds a sequence of bottom-up enumeration steps. 

Our plan in understanding the complexity issues is to start with the bottom-up aspect and eventually 
relate it to the top-down aspect. As such, we initially consider procedures based on the bottom-up style. As 
a starting point, consider the following approaches: 

• the recursion direction is fixed to be either leftmost or rightmost, 

• the recursion direction may vary between leftmost and rightmost. 

In either approach, we need an enumeration scheme which specifies the order of distance computations 
for the subproblems. 

Fixed-Direction Recursion: For recursion of fixed direction, a naive scheme is to arrange the subtree- 
subtree distance computations, as well as the relevant forest-forest distance computations, in one of two 
alternative ways as follows: 

• LR-postorder: The subtrees as well as the subforests contained in each subtree are enumerated in 
left-to-right postorder. 

• RL-postorder: The subtrees as well as the subforests contained in each subtree are enumerated in 
right-to-left postorder. 

The procedures for sorting the enumeration order for subforests are listed in Algorithms [l] and [2] 


Algorithm 1: Construct an enumeration scheme for the subforests of a tree T based on LR-postorder. 
input : T, with |T| = n 

output: an enumeration sequence L of sub forests of T based on the LR-postorder 


label the nodes of T in LR-postorder ; 

for i <- 1 to n do 

construct Si to be a sequence of subforests of T[i\ with the rightmost root enumerated in 
LR-postorder ; 


4 L = Si : 

5 for i t— 2 to n do 

6 L = L o Si ; 


7 output L ; 


A simple example of computing d(T\,T 2 ) is given in Figure [Gl where the enumeration of nodes follows 
the LR-postorder as described in Algorithm [T] A position in a table corresponding to a pair of nodes 
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Algorithm 2: Construct an enumeration scheme for the subforests of a tree T based on RL-postorder. 
input : T, with |T| = n 

output: an enumeration sequence L of sub forests of T based on the RL-postorder 


1 

2 
3 


label the nodes of T in RL-postorder ; 

for j <- 1 to n do 

construct Si to be a sequence of subforests of T[i\ with the leftmost root enumerated in 
RL-postorder ; 


4 L — Si ; 

5 for i •<— 2 to n do 

6 L = L o Si ; 


7 output L ; 


(ti[i],t 2 [j]) represents the distance between two relevant subforests with t\[i] and t 2 \j] being the rightmost 
roots. Figures 6(a) and |6(b)| show the computations for d{T\,T 2 \j\) with t 2 [j ] € {d, e, /} and d(Ti[i],T 2 ) 
with t\[i\ £ {a, &}, respectively. The computation for d{T\,T 2 ) is shown in Figure 6(c) which makes use of 
the distances computed in Figures 6(a) and|6(b)| For example, consider the position corresponding to (c, /) 


in Figure 6(c) This corresponds to d(T\,T 2 — g) where T 2 — g is the forest obtained from T 2 by removing 


the root g. By Equation O we have: 


d(Ti -c,T 2 ~g) + S(c, 0), 
d(T\,T 2 - g) = min { d(T 1 ,T 2 - g - /) + <5(0,/), 
d(0,T 2 - g- f) + d{T 1 ,f) 


Denote by Dj( x, y ) with ie((a) ( b ) (c)) the values in the tables in Figures 6(a) 6(b) and 6(c)| respectively, 
at the position corresponding to x and y. Therefore, d{T\,T 2 — g) = min{L|gj(6, /) + <5(c, 0), TJ^yj(c, e) + 

5(0, /), £|^yj(0, e) + T|^yj(c, /)} = min{4 + 2,4 + 2,4 + 5} = min{6,6, 9} = 6. Note that d(T\, /) in the last 
term is computed in the table of Figure 6(a) at Lj^{c, f). As another example, T |^| (c, d) and g) 

are computed at dSF d) and L |^jj (a, g), respectively. 

Lemma 1. The enumeration scheme based on the LR or RL-postorder takes 0(|T| 2 ) steps. 

Proof. We consider only the LR case as RL is symmetrical. Each node t, within a subtree T* *, is contained 
in exactly one relevant subforest in T ^ having f, as the rightmost root. Denote by s, the number of subtrees 
in which a node ti can be. Summing over all nodes, we have the total number of enumeration steps as 
Ei=i * < EE'i depths) < Eli 1 ! depth(T) < \T\ = 0(|T| 2 ). □ 


Variable-Direction Recursion: For recursion of variable direction, we enumerate the subforests in one 
of two alternative orders as follows: 

• Prefix-suffix postorder: For each node t[i\ enumerated in LR-postorder, we enumerate the relevant 
subforests in increasing size as those with distinct leftmost roots which contain t[i\ as the rightmost 
root. 

• Suffix-prefix postorder: For each node t[i\ enumerated in RL-postorder, we enumerate the relevant 
subforests in increasing size as those with distinct rightmost roots which contain t[i] as the leftmost 
root. 

The order of enumeration would be such that for any subforest F, all the subforests contained in F have 
been enumerated before F is enumerated. If we enumerate the subforests with the prefix-suffix postorder, 
this is done as follows. Consider in general a forest in which t t and tj are the leftmost and rightmost roots, 



























Figure 6 Tables for the computation of d(Ti,T 2 ). The basic edit costs are defined as follows: S(x,y) = 1 
if x ^ y, and 0 if x = y. 6(x,0) = 6(0, x) = 2. The optimal edit scripts can be traced with the arrow 
sequences. 



T\ 


T 2 



(c) 
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respectively. The rightmost root is enumerated in a left-to-right postorder starting at the leftmost leaf. 
For each tj thus enumerated, consider the largest forest with tj being the rightmost root. Now, to obtain 
the order for the sub forests contained in this forest with tj being the rightmost root, let Fj, F 2 ,- • ■ ,Fj. be 
the sequence of subforests resulted from successively deleting the leftmost root from the forest until only 
the rightmost subtree rooted on tj remains, i.e., Fk = T[tj]. The order we want is the reverse sequence 
Ffc, Fk- 1 , ■ • ■ ,Fi. In this way, we obtain a sequence of subforests for each tj. Concatenate all the sequences 
in the increasing order of tj , we have the final sequence of all the subforests of T arranged in a proper 
order. The alternative way of enumerating the subforests, namely the suffix-prefix postorder, is handled 
symmetrically. The procedures are listed in Algorithms [3] and 0 


Algorithm 3: Construct an enumeration scheme for the subforests of a tree T based on prefix-suffix 
postorder. 

input : T, with |T| = n 

output: an enumeration sequence L of sub forests of T based on the prefix-suffix postorder 


1 

2 

3 

4 


5 


construct P to be a sequence of subforests of T resulted from successive deletion on the rightmost 
root ; /* P[ 1] = T */ 

construct P' = (Fi, F 2 , • • • , F n ) to be the reverse sequence of F ; /* F n = T */ 

for i <r- 1 to n do 

construct Si to be a sequence of subforests of Fj G P', all sharing the same rightmost root, 
resulted from successive deletion on the leftmost root ; /* <Sj [1] = Fj, 

iSf[fc] = Si[k — 1] — lmjroot(Si[k — 1]), rmjroot(Si[k\) = rmjroot(Si[k — 1]), Vfc > 1 */ 
construct S- to be the reverse sequence of Si ; /* <S , '[|S''|] = Fj */ 


e L = S[ ; 

7 for i <— 2 to n do /* concatenate all sequences */ 

8 |_ L = L o S! ; 


9 output L ; 


Algorithm 4: Construct an enumeration scheme for the subforests of a tree T based on suffix-prefix 
postorder. 

input : T, with |F| = n 

output: an enumeration sequence L of sub forests of T based on the suffix-prefix postorder 


1 

2 

3 

4 


5 


construct S to be a sequence of subforests of T resulted from successive deletion on the leftmost root ; 
/* S[l\ =T */ 

construct S' = (Fi, F 2 , ■ ■ • , F n ) to be the reverse sequence of S ; /* F n =T */ 

for i •<— 1 to n do 

construct Pi to be a sequence of subforests of Fi £ S', all sharing the same leftmost root, resulted 
from successive deletion on the rightmost root ; /* Fj[l] = Fj, 

Pi[k] = Pi[k — 1] — rmjroot(Pi[k — 1]), lmjroot(Pi[k]) = lmjroot(Pi\k — 1]), Vfc > 1 */ 
construct P[ to be the reverse sequence of Fj ; /* Fj'[|Pj'|] = Fi */ 


e L = P[- 

7 for i £- 2 to n do /* concatenate all sequences */ 

8 L L = L o PI ; 


9 output L ; 


Examples of prefix-suffix and suffix-prefix postorder enumerations are given in Figure [T] and Figure [U 
respectively. In Figure 0 subforests having the same rightmost root are in contiguous boxes, whereas in 
Figure 0 subforests having the same leftmost root are in contiguous boxes. 


10 










Figure 7 An example of enumerating subforests in prefix-suffix postorder. 
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Figure 8 An example of enumerating subforests in suffix-prefix postorder. 
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Lemma 2. The enumeration scheme based on prefix-suffix or suffix-prefix postorder takes 0(|T| 1 2 ) steps. 

Proof. We consider only the prefix-suffix postorder as the suffix-prefix postorder is the symmetrical case. 
Denote by fi the number of subforests with distinct leftmost roots which contain ti as the rightmost root. 
Summing over all nodes, we have X}|=i fi < X][=i 1^1 = 0(|T| 2 ). □ 

An algorithm for computing tree edit distances where the relevant subforests are enumerated by the 
above procedures is given in Algorithm [5] The algorithm can be implemented using 0(|Ti| x |X 2 1) space if 
the forest distances are allowed to be overwritten. 


Algorithm 5: Compute tree edit distance in 0(m 2 n 2 ) time, 
input : (Ti,T 2 ), with |Ti| = m and |T 2 | = n 
output: d(Ti[i\,T 2 [j}) for 1 < i < m and 1 < j < n 


1 

2 

3 

4 


sort relevant subforests of (Ti,T 2 ) into {L\,L 2 ) as in Algorithms [U [2] [31 or 0]; 

for * <r- 1 to |Li | do 
for j 4— 1 to |L 2 | do 

|_ compute L 2 [j]) as in Equation [2]; 


Theorem 1. The tree edit distance as computed in Algorithm^ takes 0(m 2 n 2 ) time, where m = \T\\, and 
n=\T 2 \. 

Proof. The result follows directly from Lemma |T| and [2] □ 

The algorithms presented in this section follow a bottom-up dynamic programming style where the tree 
nodes are numbered in postorder, in contrast to the preorder numbering of nodes in Tai’s algorithm mi- 
The way Tai’s algorithm works is to progressively increase the sizes of the trees, by one node at a time 
following the preorder numbers, and compute the distance for each such pair of partial treefl 


3 Improved Algorithmic Strategies 

The algorithm presented in the previous section is based on the principle of dynamic programming which 
relies on a well-defined scheme for enumerating the relevant subforests. In this approach, forest distances are 
arranged in a certain order so as to facilitate the relay of distance computations. Essentially, we take advan¬ 
tage of the overlap among subforests that are contained in the same subtree. To make further improvement, 
we look for ways to take advantage of the overlap among subtrees as well. 

3.1 Leftmost Paths 

We examine recursion of fixed direction, say rightmost recursion, the situation for leftmost recursion being 
symmetrical. This means that the enumeration will be in LR-postorder. Consider a path (ti,t 2 ,--- fik) 
where U is the leftmost child of tj+i for 1 < * < k — 1. Let (Ti, T 2 , • • • , Tk) be the sequence of subtrees where 
ti is the root of Xj, and (Fi, F 2 , • • • ,Fk) be the sequence of sets where Fi denotes the set of subforests of Xj 
all containing the leftmost leaf of Ti. We have Xj C F 2 C • • • C XV This means that enumerating Fk once 
effectively takes care of the enumerations for Fi, F 2 , • • • , Fk- 1 - To generalize this situation to the whole tree, 
we see that all subtrees sharing the same leftmost leaf can be handled together. Carried out in this way, 
a tree is recursively decomposed into disjoint leftmost paths where each such leftmost path is shared by a 
set of subtrees which can be handled together along this path with the LR-postorder enumeration thereby 

1 In fact, it does not compute the true distance since it only considers the optimal mappings along a pair of paths for each pair 

of partial trees, instead of the entire partial trees. If Algorithm [5] is applied for each pair of partial trees, the time complexity 

is easily seen as 0(m 3 ra 3 ). 
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removing the repetitions. This strategy was developed by Zhang and Shasha m■ An example of such path 


decomposition is given in Figure 9(a) 


Figure 9 Leftmost paths and rightmost paths (in thick edges). 



(a) leftmost paths (b) rightmost paths 


Each leftmost path corresponds to the smallest subtree that contains this path, and the root of this 
subtree is referred to as an “LR-keyroot”, which is defined as follows. 

Definition 4 (LR-keyroots). An LR-keyroot is either the root of T or has a left sibling. 

The new enumeration scheme works as follows. We identify all the LR-keyroots in the tree, and sort 
them in increasing order by their LR-postorder numbers, referred to as “LR-keyroot postorder”. This will 
be the order by which the subforests are enumerated, i.e., based on the LR-keyroots with which they are 
associated. The procedure is listed in Algorithmic] 


Algorithm 6: Construct the enumeration scheme for the subforests of a tree T in LR-keyroot postorder. 
The RL-based procedure is symmetrical to this, 
input : T, with |T| = n 

output: an enumeration sequence L of sub forests of T in the LR-keyroot postorder 


1 

2 

3 

4 


identify the LR-keyroots of T ; 

sort the LR-keyroots in increasing order of LR-postorder numbers into a list K = {fci, /c 2 , ■ • • ,ki} ; 

for i -f— 1 to l do 

construct Si to be a sequence of subforests of T[ki\ with the rightmost root enumerated in 
LR-postorder ; 


5 L = Si ; 

6 for it— 2 to n do 

7 |_ L = L O Si ; 

8 output L ; 


This enumeration scheme gives rise to the algorithm in Algorithm [7] 


Algorithm 7: Compute tree edit distance in 0(mn []^ = i m i n {depdi(Ti), #Ieaues(Ti)}) time. 

input : (Ti,T 2 ), with |Ti| = m and |T 2 | = n 
output: d(Ti[i], T 2 [j]) for 1 < i < m and 1 < j < n 


1 

2 

3 

4 


sort relevant subforests of (Ti,T 2 ) into (Li,L 2 ) as in Algorithmic]; 
for i t— 1 to |Li | do 
for j f- 1 to |L 2 | do 

|_ compute d(Li[i\, L 2 [j}) as in Equation[2]; 
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Theorem 2. The algorithm computes d(Ti[i],T 2 [j]) for all 1 < i < |Ti| and 1 < j < \T 2 \. 

Proof. We prove it by induction on the sizes of the subtrees induced by the keyroots. 

Base case: This involves only the singleton subtrees. Since all the basic edit costs with respect to single 
nodes are already defined, the base case holds. 

Induction hypothesis: For any (z, j) £ {( i,j ) | i £ LR-keyroots(Ti ), j £ LR-keyroots(T 2 )}, just before the 
computation of d(Ti[i\, T 2 [j]), the following set of distances have been computed, D = D\ U D 2 where 

• Di = {d(T 1 [i'],T 2 [j']) | i' £ Ti[i] - leftmost-path{Ti[i \), j' £ T 2 [j]}, 

• D 2 = {d{T\[i'], T 2 \j'}) | i' £ T\[i\, j' £ T 2 [j] - leftmost-path(T 2 \j])}. 

Induction step: We show that {d{Ti\i'],T 2 [j']) \ %' £ Ti[i], j' £ T 2 [j}} are all computed. The subtree- 
subtree distances to be computed in the process of computing d(Ti[i),T 2 [j]) are {d{Ti[i'],T 2 [j']) \ i' £ 
leftmost-path{T\[i ]), j' £ leftmost-path(T 2 [j])}. The induction step holds since it is in accord with the 
LR-keyroot postorder that the algorithm follows, which means that all distances specified in the induction 
hypothesis have been computed. This concludes the proof. □ 

To see the impact of the leftmost-path decomposition scheme on the time complexity, it is necessary to 
introduce the concept of “LR-collapsed depth” defined as follows. 

Definition 5 (LR-Collapsed Depth). The LR-collapsed depth of a node ti is the number of its ances¬ 
tors that are LR-keyroots. The LR-collapsed depth of a tree T is defined as LR-collapsed-depth(T ) = 
max {LR-collapsed-depth(ti) | ti £ T}. 

Intuitively, the LR-collapsed depth of a tree T represents the maximal number of non-leaf LR-keyroots 
that a path in T may contain. We define LR-collapsed depth as a way to estimate the maximal times a node, 
representing the rightmost root of some relevant subforest, is enumerated with the LR-keyroot postorder. 
As a consequence of this enumeration scheme, repetitious enumerations involving a given node are removed 
since subtrees containing this node as well as having the same leftmost leaf are no longer handled separately. 

Lemma 3. LR-collapsed-depth(T ) < min {depth(T), ffleaves(T)}. 

Proof. Since the number of LR-keyroots on any path is bounded by the depth of the path, we have 
LR-collapsed-depth(T) < depth(T). For any two LR-keyroots ki and kj, the subtrees T,; and X) rooted 
at ki and kj have distinct leftmost leaves. This means that the number of subtrees in T that are rooted 
at LR-keyroots can not exceed the number of leaves, i.e., ffLR-keyroots(T ) < ffleaves(T). Since the 
number of LR-keyroots on any path is no more than the total number of LR-keyroots in the tree, i.e., 
LR-collapsed-depth(T) < ffLR-keyroots(T ), we have LR-collapsed-depth(T ) < ffleaves(T). Therefore, 
LR-collapsed-depth(T ) can be bounded by depth(T) or ffleaves(T), whichever is smaller. This concludes 
the proof. □ 

Here is the implication of Lemma [3] In the previous procedure, a node in T may be enumerated 
depth(T ) times with the LR-postorder enumeration scheme, because the maximal number of subtrees in 
which a node may be contained is depth{T). Grouping together subtrees with the same leftmost leaf can 
remove the repetitions, and the improvement is evident since the upper bound is reduced from depth(T ) to 
min { depth(T ), ffleaves(T)}. 

Theorem 3. The tree edit distance problem can be solved in 0(m?i J~]j =1 min {depth(Ti), ffleaves(Ti)}) time, 
where m = |Ti| and n = \T 2 \. 

Proof. From Lemma [3J each node, representing the rightmost root of some relevant subforest, in T is enu¬ 
merated at most LR-collapsed-depth(T ) times using the enumeration scheme in Algorithm [6] Hence, the 
result follows directly. □ 

Theorem 4. The tree edit distance problem can be solved in 0(mn) space, where m = |Ti| and n = \T 2 \. 
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Proof. The computation uses two m x n tables D t and Df. The forest-forest distances are computed in Df 
where the values can be overwritten when the computation moves from one pair of subtrees to another pair. 
The subtree-subtree distances obtained in the process of computing the forest-forest distances are stored in 
D t , and fetched for use in computing forest-forest distances. □ 

In this section, a new way is presented for enumerating the relevant subforests in LR-postorder where 
repetitious steps associated with the leftmost paths in a tree are eliminated, resulting in an improved time 
complexity. However, depending on the shapes of the trees, the leftmost-path decomposition for some tree 
shapes could yield marginal benefits regarding the running time. This leads to the strategy to be presented 
in the next section. 

3.2 Heavy Paths on One Tree 

We see from the previous section that the computation time is due to the enumeration of subforests where 
each enumeration step counts a constant time in performing a few simple arithmetics. The leftmost-path 
strategy improves the time complexity by enumerating subtrees with overlapping leftmost paths together in 
the same sequence of computation. Since the running time is dependent on the shapes of the trees, it is 
worthwhile to consider a different type of path decomposition that can also offer benefits with respect to 
the complexity. This possibility was explored and a new decomposition strategy based on a type of path 
referred to as “heavy path” is due to Klein [7]. In contrast to the Zhang-Shasha strategy, which may be 
seen as a way of improving upon the naive fixed-direction procedure based on the LR-postorder enumeration 
scheme given in Section [2J the new strategy may be seen as a way of improving upon the variable-direction 
procedure based on the prefix-suffix or suffix-prefix postorder enumeration scheme. We give a few definitions 
related to the idea behind heavy path. 

Definition 6 (Heavy Child/Node). For any node t in T, the child th which is the root of the largest subtree 
(breaking tie arbitrarily) among the sibling subtrees is the heavy child oft. We use the terms “heavy child” 
and “heavy node” interchangeably. 

The definition of heavy path is given as follows. 

Definition 7 (Heavy Path). (TTh The heavy path of a tree T is a unique path connecting the root and a 
leaf of T on which every node, except the root, is a heavy node. 

Figure [TU] shows an example of a tree recursively decomposed into a set of heavy paths. 


Figure 10 Heavy paths (in thick edges). 


'i, 
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Similar to LR and RL-postorder which are defined with respect to the leftmost path and rightmost path, 
respectively, we define an enumeration scheme with respect to the heavy path as follows. 

Definition 8 (H-Postorder). The nodes in tree T is enumerated in H-postorder as follows. Start at the leaf 
ti on heavy-path(T), enumerate the subtrees rooted on its right siblings, if any, in LR postorder, then the 
subtrees rooted on its left siblings, if any, in RL postorder. Continue and repeat the same process with each 
next higher node on heavy-path(T) until reaching root(T). 
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If we ignore what happens on the left side of the heavy-path during an H-postorder enumeration, then we 
see a sequence of enumeration steps identical to an LR-postorder enumeration. If we ignore what happens on 
the right side of the heavy-path during an H-postorder enumeration, then we see a sequence of enumeration 
steps identical to an RL-postorder enumeration. Alternatively, a second version symmetrical to this one, i.e., 
RL then LR intermittently, also works. In the following presentation, the version in Definition [5] is used. An 
example of enumerating subforests in H-postorder is given in Figure [TT] 


Figure 11 An example of enumerating subforests in H-postorder. 



Analogous to LR-keyroots, a type of keyroots specific to this context is defined as follows. 

Definition 9 (H-keyroots). An H-keyroot is either the root of T or the root of a subtree in T that has a 
larger sibling subtree. If multiple subtrees are equally the largest among their sibling subtrees, all but one 
(chosen arbitrarily) are H-keyroots. 

Definitions [6] and [9] are equivalent since for any node, once its heavy child is specified, the other children 
are H-keyroots, and vice versa. A node in a tree is either a heavy node or an H-keyroot. 

The algorithm works as follows. The H-keyroots in the larger tree are sorted into a list L\ in increasing H- 
postorder numbers. For each subtree of which the root is in L\, order the relevant subforests in H-postorder, 
and concatenate all the ordered sequences to form the entire sequence as listed in Algorithm [51 which we 
call the “H-keyroot postorder”. On the smaller tree, all subforests are ordered into a list L 2 in prefix-suffix 
or suffix-prefix postorder, as in Algorithms [3] or [4] The new algorithm is listed in Algorithm [9] 


Algorithm 8: Construct the enumeration scheme for the subforests of a tree T in H-keyroot postorder. 


input : T, with |T| = n 

output: an enumeration sequence L of subforests of T in the H-keyroot postorder 

1 identify the H-keyroots of T ; 

2 sort the H-keyroots in increasing order of H-postorder numbers into a list K = {k\, k-i , • • • , ki} ; 

3 for i •<— 1 to l do 

4 |_ construct S, to be a sequence of subforests of T[ki ] enumerated in H-postorder ; 

5 L = Si ; 

6 for i <r- 2 to n do 

7 |_ L = L o Si ; 

8 output L ; 


Theorem 5. The algorithm computes d(Ti[i],T 2 \j]) for all 1 < i < |Ti| and 1 < j < \T 2 \. 

Proof. We prove it by induction on the sizes of the subtrees induced by the keyroots. 

Base case: This involves only the singleton subtrees. Since all the basic edit costs with respect to single 
nodes are already defined, the base case holds. 

Induction hypothesis: For any k € {k \ k G H-keyroots(T 2 )} , just before the computation of d(Tj,T^fe]), 
[d(Ti[i],T 2 [j]) | i G Tj, j £ T 2 [k] — heavy-path(T 2 [k ])} have been computed. 
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Algorithm 9: Compute tree edit distance in 0(m 2 n\ogn) time, 
input : (Ti,T 2 ), with |Ti| = m, \T 2 \ = n, and m < n 
output: d(Tt [*]. T 2 [j\) for 1 < i < m and 1 < j < n 


1 

2 

3 

4 


sort relevant subforests of T\ into L\ as in Algorithms [3] or [H and T 2 into L 2 as in Algorithm [8] ; 

for i •<— 1 to \Li\ do 
for j <r- 1 to \L 2 \ do 

|_ compute d(Li[i], L 2 [j]) as in Equation^]; 


Induction step: We show that {d(Ti[i\,T 2 [j]) | i £ Ti, j £ T 2 [k}} are all computed. The subtree- 
subtree distances to be computed in the process of computing d(Ti,T 2 [k]) are {d(Ti[i],T 2 \j]) \ i £ Ti, j G 
heavy-path(T 2 [k])}. The induction step holds since it is in accord with the postorder that the algorithm 
follows, which means that all distances specified in the induction hypothesis have been computed. This 
concludes the proof. □ 

We consider some aspects of the time complexity for this algorithm as follows. 

Lemma 4. Let h\,h 2 , - ■ • ,hk be any sequence of H-keyroots that are on the same path where hi is an ancestor 
of hj if i < j. Then, \T[hj]\ <\T[hi)\/2 if j = i + 1. 

Proof. Suppose that |T[/ij]| > |T[/ij]|/2. There are two cases to consider. 

1. The nodes hi and hj are consecutive nodes on the path. 

2. The nodes hi and hj are not consecutive nodes on the path. 

In case[U hi is the parent of hj. If |T[/ij]| > |T[/ij]|/2, hj is the heavy child of hi, which is a contradiction 
to the fact that hj is an H-keyroot. In case [2] there exists a node t on the path that is a descendent of hi as 
well as the parent of hj. Since |T[A,]| > |T[/q]|/2 and |T[/q]| > |T[£]|, we have |T[/ij]| > |T[f]|/2. This means 
that hj is the heavy child of t, contradicting the fact that hj is an H-keyroot. This concludes the proof. □ 

Analogous to LR-collapsed depth, a new version of collapsed depth based on H-keyroots is defined as 
follows. 

Definition 10 (H -Collapsed Depth). The H-collapsed depth of a node ti is the number of its ances¬ 
tors that are H-keyroots. The H-collapsed depth of a tree T is defined as H-collapsed-depth(T) = 
ma x{H-collapsed-depth(ti) \ ti G T}. 

Lemma 5. H-collapsed-depth(T) < log 2 |T|. 

Proof. Consider a path P in T and the H-keyroots ho, hi, h 2 , ■ ■ ■ , hk on P with ho being the root of T. From 
Lemma each H-keyroot hi on P is rooted at a subtree the size of which is no larger than half the size 
of the subtree rooted at hi-\. Starting at ho, traverse down the path P. For each subsequent H-keyroot 
that is being visited, the corresponding subtree size is reduced by at least a factor of 2 with respect to the 
nearest H-keyroot previously visited. It takes at most log 2 |T| encounters of H-keyroots for the subtree size 
to be reduced to 1, which is also the maximal number of H-keyroots a node may have as its ancestors. This 
concludes the proof. □ 

In contrast to LR-collapsed depth, H-collapsed depth has an improved upper bound on the number of 
times that a node in the larger tree may be enumerated, which is related to how many separate distance 
computations, as identified by distinct keyroots, in which a node may participate. The bound, on the other 
hand, for a node in the smaller tree to be enumerated is the size of the tree, since all the subforests are 
considered. The overall impact on the time complexity is given in the next theorem. 
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Theorem 6. The tree edit distance problem can be solved in 0(m 2 n\ogn) time where |Ti| = m, \T 2 \ = n, 
and m < n. 

Proof. For any (i,j) with i G Ti and j G T 2 , i is enumerated the number of times equal the number of 
subforests with distinct leftmost roots which contain i as the rightmost root, or alternatively, the number of 
subforests with distinct rightmost roots which contain i as the leftmost root. This is bounded by the size of 
Ti, i.e., m. On the other hand, j is enumerated at most 1 + log 2 n times according to Lemma0 since this 
is the upper bound on the number of subtrees in T 2 rooted on distinct H-keyroots which contain j , and in 
each one j is enumerated once. The result thus follows. □ 

Theorem 7. The new algorithm solves the tree edit distance problem in 0(mn) space where |Ti| = m, 
\T 2 \ = n, and m < n. 

Proof. We use a 2 x m 2 table where the m 2 subforests in Ti are arranged in prefix-suffix or suffix-prefix 
order. For T 2 , the idea is essentially a linear-space algorithm by which distances for only one subforest are 
computed and updated when moving to the next subforest in the enumeration sequence. The subtree-subtree 
distances are stored in an m x n table. □ 

In the next section, we see how this algorithm is improved by a strategy that finds a way to apply 
heavy-path decompositions on both trees. 

3.3 Heavy Paths on Both Trees 

The algorithm by Klein reduces the upper bound on the number of separate distance computations re¬ 
quired from 0(mm{depth(T), #leaves(T)}) to 0(log|T|) for one tree. This is done at the cost of having 
to consider all the subforests in the other tree. Demaine et al. 0] improved this strategy by a way that 
applies decompositions on both trees. By their algorithm, d(Ti,T 2 ) is computed as follows, assuming that 

|Ti| < \T 2 \: 

1. If |Ti| > \T 2 \, compute d(T 2 ,Ti). 

2. Recursively, compute d(Ti,T 2 [k\) with k being the set of nodes connecting directly to heavy-path(T 2 ) 
with single edges. 

3. Compute d(Ti,T 2 ) by enumerating relevant subforests of Tf in prefix-suffix (Algorithm [3]) or suffix- 
prefix postorder (Algorithm 0) , and relevant subforests of T 2 in H-postorder (Definition [8]). 

This is a combined recursive and bottom-up procedure where the order of subtree-subtree pairs is ar¬ 
ranged recursively in step 0 whereas the forest-forest distances encountered in a subtree-subtree distance 
computation, in step0 are computed with bottom-up enumerations. In comparison, the algorithm by Klein 
consists of only steps 0 and 0 without step 0 Due to step 0 decomposition is done on both trees. Here, 
step 0 differs from the procedure in 0 where the computation is done with recursion. Nonetheless, they are 
equivalent since the precondition, that the subtree-subtree distances related to step 0 have been obtained, is 
the same. These distances are: d(Ti[i\,T 2 [j]) for all i £ Ti and j G T 2 — heavy-path(T 2 ). The subtree-subtree 
distances obtained in step 0 alone are d(Ti[i\, T 2 [j\) for all * G Ti and j G heavy-path(T 2 ). Therefore, the 
postcondition of step 0 is that d(T\[i], T 2 [j]) for all i G Ti and j G T 2 have all been obtained. To adapt 
the procedure into a bottom-up dynamic programming algorithm, the order of computation sequence can 
be obtained in advance by running the recursion of step 0 and only recording the subtree pair in step 0 
without actually computing the distance. This yields the bottom-up computation sequence. 

We now consider some aspects of the algorithm. 

Theorem 8. The algorithm computes d(Ti[*],T 2 [j]) for all 1 < i < |Ti| and 1 < j < \T 2 \. 
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Proof. We prove it by induction on the sizes of the subtrees induced by the keyroots. 

Base case: This involves only the singleton subtrees. Since all the basic edit costs with respect to single 
nodes are already defined, the base case holds. 

Induction hypothesis: For any (i,j) £ {(*, j) \ i £ H-keyroots(Ti ), j £ H-keyroots(T 2 )} , after step [21 

1. if |Ti[*]| < |T 2 [j]|, then {d(Ti[z'], T 2 [j']) \ i' £ Ti[i], j' £ T 2 [j] — heavy-path(T 2 [j])} have been computed, 

2. if|Ti[*]| > |IZ 2 [J]|, then {d(Ti[i'],T 2 [j'}) \ i' £ Ti[i\ — heavy-path(Ti[i]), / G T 2 [j]} have been computed. 

Induction step: We show that {d{Ti[i'],T 2 \j']) \ i' £ T\[i], j' £ X 2 [j]} are all computed. The subtree-subtree 
distances to be computed in step [3] are: 

1. {d(T 1 [i'],T 2 [j']) | i' £ Ti[i], j' £ heavy-path(T 2 [j])}, if |Ti[i]| < \T 2 \j]\, 

2. {d(T 1 [i'],T 2 \j']) | i' £ heavy-path(Ti [*]), j’ e T 2 [j]}, if |Ti[i]| > \T 2 [j]\. 

The induction step holds since it is in accord with the postorder that the algorithm follows, which means 
that all distances specified in the induction hypothesis have been computed. This concludes the proof. □ 

Given two subtree pairs (Ti[/c],T 2 [fc']) and (Ti[Z], T 2 [l']) in which (i,j) is contained, with k,l £ 
H-keyroots(Ti), k’,l' £ H-keyroots(T 2 ) 1 k = nrin{a: | x £ ancestors{l ) n H-keyroots(Ti)}, and k' = 
min{a; | x £ ancestors(l’) H H-keyroots(T 2 )} , we consider the possibilities pertaining to the relative sizes 
of the four trees, where we use H-keyroots to represent the corresponding subtree sizes. We write k -< l if 
™>|T[Z]|. 

1. k -< l -< k' -< as in Figure |12(a)| 

2 . k -< k' -< l -< l ', as in Figure |12(b)| 

3. k -< k 1 -<l' -< l, as in Figure [12(c) | 


Figure 12 Depiction of possible cases for relative subtree sizes due to heavy-path decomposition. A line 
between two size levels (thick lines) indicates that a distance computation is performed for subtrees of cor¬ 
responding size levels. Here, k,l £ H-keyroots[T \), and k\V £ H-keyroots(T 2 ). T\[k] and decompose 
once to yield T\ [Z] and T 2 [/'], respectively. 



The distance computations in which a pair of nodes (i,j) would participate are represented by solid 
lines drawn between the size levels as shown in Figure [T2] These situations arise as a result of only the 
larger subtree being allowed to decompose. If we count the number of enumeration steps involving (i,j), the 
analysis is as follows. We enumerate each node in the larger tree once, while enumerate each node in the 
smaller subtree a number of times no more than the size of the subtree. This way of counting regarding the 
smaller subtree is based on how many subforests with distinct leftmost roots may include the node as the 
rightmost root, or symmetrically, how many subforests with distinct rightmost roots may include the node 
as the leftmost root. For caseQ] i.e., k -< l -< k' -< l', i is counted once for k and l each, while j is counted 
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|T 2 [fc']| steps for k', for a total of 2|T 2 [fc']| steps. For case[2j i.e., k -< k! -< l ~< V, (i,j) is counted 1 x |T 2 [fc']| 
steps for ( k , k'), |Ti[Z]| x 1 steps for (l, k'), and 1 x |T 2 [Z']| steps for (l, l'), for a total of |T 2 [fc']| + |Ti[Z]| + |T 2 [Z']| 
steps. For caseO i.e., k -<k' -<l' -<l, (i,j) is counted 1 x |T 2 [fc']| steps for (k,k'), |Ti[/]| x 1 steps for (l,k'), 
and |Ti[Z]| x 1 steps for (1,1'), for a total of |T 2 [fc']| + 2|Ti[Z]| steps. In the time complexity analysis, the 
steps in case [3] can be bounded by replacing l by u where k' -< u -< V, which results in the two lines incident 
to l being replaced by the two lines incident to u, returning back to case [2] This means that in the time 
complexity analysis, we only need to consider steps from case [T] and case [ 2 ] as well as their symmetrical 
counterparts. Figure ITSl illustrates a situation where (i, j) are enumerated as a pair in the worst case (i.e., 
1 + log 2 m and 1 + log 2 n levels, respectively) with respect to the sizes of the subtrees in which (i,j) are 
contained. 

The following lemma is based on an observation that is crucial in obtaining the claimed time bound. 

Lemma 6. Let W = {w\ ,w 2 ,--- ,Wk} be a list of numbers satisfying that for any Wi,Wj E W, Wj < ^ if 
j = i + 1. Then, for any u, S = J2i= u w i ^ ^ w u- 

Proof. Recall that R = X]"=o W — 2 for any n > 0, which is proved by showing that 2 R — R = 2 — -L <2. 
Therefore, we have S = Yli= u w i — w u J2i=o h — 2 w u■ □ 

The following theorem gives the result for the time complexity of the algorithm. 

Theorem 9. The tree edit distance problem can be solved in 0(m 2 n( 1 + log^)) time where |Tj| = m, 
|T 2 | = n, and m < n. 

Proof. For any (z, j) where i E T\ and j G T 2 , we count the number of times that (z,j) is enumerated in 
distance computations in all possible combinations based on the relative sizes of the subtrees in which i and 
j are contained. These combinations can be divided into three categories: 

1. (z, j) G (Ti [h], T 2 [h']) for some (h,h’), with |Ti[/i]| < to and to < |T 2 [Zi']| < n, 

2. (i,j) G (Ti[/i],T 2 [/z / ]) for some (h,h!), with m > |Ti[/z]| > |T 2 [/i']|, 

3. (i,j) G (Ti [h], T 2 [h 1 ]) for some (h,h'), with |Ti[/i]| < |T 2 [/z']| < to. 

In the above cases, for each pair of nodes (z, j) that participate in a distance computation for a pair of 

subtrees, the node in the larger subtree is counted once, while the node in the smaller subtree is counted 
a number of times no more than the size of the subtree. This way of counting with respect to the smaller 
subtree is based on how many subforests with distinct leftmost roots may include the node as the rightmost 
root, or symmetrically, how many subforests with distinct rightmost roots may include the node as the 
leftmost root. 

Let Si, S' 2 , and S 3 be maximal numbers of total enumeration steps corresponding to category [T] [21 and 
[U respectively. 

From Lemma [5j a node in T, with \T\ = n > to, can be in at most 1 + log 2 to subtrees of sizes no more 
than to, rooted at distinct H-keyroots. Therefore, Si < TO 2 n(log 2 n — log 2 to) = TO 2 nlog 2 —. 

For S 2 and S 3 , we give a simplified analysis which includes all combinations of which some are redundant 
due to the fact that a smaller subtree does not decompose until it becomes the larger one. This, however, does 
not change the complexity as the difference is within a negligible factor, due to Lemma 0 From Lemma [21 
0 and[ 6 l we have S 2 < m(n x 2 to 52l°fo m 5 ?) < 4to 2 tz, and S 3 < (to x 2 to Y^\°=o — 4to 2 71. This yields 

a total number of steps in the worst case as Si + S 2 + S 3 = 0(m 2 n( 1 + log —)). 

For a more accurate estimate of S 2 and S 3 (see Figure fl3l). we have S 2 < to x nm + 2m x (n ™ p) < 
3m 2 n, and S 3 < mm xn + (to m fr) x 2n < 3 m 2 n. Hence, the total time is Si + S 2 + S 3 = 0(m 2 n(l + 

l°g£))- " ° 
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Figure 13 Depiction of the situation where (i,j) are enumerated as a pair in the worst case (i.e., 1 + log 2 m 
and 1 + log 2 n levels, respectively) with respect to the sizes of subtrees in which (i,j) may be contained. 
Levels of different sizes are represented by thick lines. A line is drawn between two size levels to indicate 
inclusion of (i, j) where an arrowhead points to the smaller size. For size levels no more than m, two types 
of arrowheads (filled and hollow) are used to distinguish between alternative sequences of decompositions 
where the same sequence can be traced by following the lines with same type of arrowheads. 
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Remark: It has been shown that there exist trees for which f2(m 2 n(l + log ^)) time is required to compute 
the distance no matter what strategy is used [4]. 

Theorem 10. The new algorithm solves the tree edit distance problem in 0(mn) space where |7i| = m, 
\T 2 \ = n, and m < n. 

Proof. The method is same as that described in Theorem [7] □ 

The efficiency of the algorithm can be tightened up by combining all three path decomposition strategies 
(i.e., leftmost, rightmost, and heavy paths) to yield an algorithm with the least total enumeration steps. 
The basic idea, while retaining the general framework of the algorithm, is to recursively count the number 
of enumeration steps resulted from different types of path decompositions without actually carrying out 
the distance computations within the original algorithm. This means that for any d(T\ [i], T 2 [j]) to be 
computed by the algorithm, step [3] counts the number of enumeration steps involving the nodes on the path 
for (T\ [i], T 2 \j\) with respect to each type of decomposition, while the steps involving other nodes that do 
not belong to the path are counted recursively at step [2] (i.e., one recursive call for each type of path) and 
combined with the counts in step [3] so as to decide which path to use for that level. The results from each 
level are recorded into a table which consumes O(mn) space. The recorded information is then used to guide 
the distance computations in the selection of strategy at each step. This yields an overall least total number 
of enumeration steps with respect to all strategies considered. The time complexity, however, remains the 
same due to the above remark regarding the lower bound. 


4 Conclusions 

This article considers the tree edit distance problem and formulation of solutions in the form of recursion. In 
particular, a class of algorithms based on closely related decomposition schemes for computing the tree edit 
distance between two ordered trees are reviewed, with an attention to aspects of time complexity analysis. 

As a summary of the contents presented in Section [3] we recapture the related path-decomposition 
strategies as follows. 

Leftmost paths: d(T\,T 2 ) is computed as follows. 

1. Recursively, compute d(Ti[k],T 2 ) and d(T\ : T 2 [k'\), with k being the set of nodes connecting directly 
to leftmost-path(Ti) with single edges, whereas k' being the set of nodes connecting directly to 
le ftmost-path(T 2 ) with single edges. 

2. Compute d{T\,T 2 ) by enumerating relevant subforests of T\ and T 2 in LR-postorder. 

Heavy paths on one tree: d(Ti,T 2 ) is computed as follows. 

1. Recursively, compute d(Ti,T 2 [k]) with k being the set of nodes connecting directly to heavy-path(T 2 ) 
with single edges. 

2. Compute d{T\,T 2 ) by enumerating relevant subforests of Tf in prefix-suffix or suffix-prefix postorder, 
and relevant subforests of T 2 in H-postorder. 

Heavy paths on both trees: d(T\,T 2 ) is computed as follows, assuming that |Ti| < |T 2 |. 

1. If |Ti| > |T 2 |, compute d(T 2 ,Ti). 

2. Recursively, compute d(Ti,T 2 [k]) with k being the set of nodes connecting directly to heavy-path(T 2 ) 
with single edges. 

3. Compute d{T\,T 2 ) by enumerating relevant subforests of Tf in prefix-suffix or suffix-prefix postorder, 
and relevant subforests of T 2 in H-postorder. 
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All of the above strategies can be equivalently stated as applying Equation [2] according to predefined 
directions without recursing into subproblems already computed. 
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